Set the working directory, clear all existing objects in the workspace and set the seed for reproducibility. Lastly, load the necessary libraries.
# Set the working directory
setwd("./")
# Clear all existing objects in the workspace
rm(list = ls())
# Set the seed for reproducible results
set.seed(1009)
# Load libraries
library(dplyr)
library(ggplot2)
library(ggmap)
library(scales)
library(RColorBrewer)
train.values <- read.csv("../data/clean-training-set-values.csv", header = TRUE, na.strings = "NA")
train.values$population[train.values$population == 0] <- NA
train.labels <- read.csv("../data/src-training-set-labels.csv", header = TRUE)
For the analysis, the training values and labels will be merged, and the exploration will be conducted on the combined dataset. ## Merge the training values and labels
training.data <- merge(train.values, train.labels)
The following variables contain missing values:
summary(training.data)
## id amount_tsh date_recorded funder
## Min. : 0 Min. : 0.0 2011-03-15: 572 Government Of Tanzania: 9084
## 1st Qu.:18520 1st Qu.: 0.0 2011-03-17: 558 Danida : 3114
## Median :37062 Median : 0.0 2013-02-03: 546 Hesawa : 2202
## Mean :37115 Mean : 317.7 2011-03-14: 520 Rwssp : 1374
## 3rd Qu.:55656 3rd Qu.: 20.0 2011-03-16: 513 World Bank : 1349
## Max. :74247 Max. :350000.0 2011-03-18: 497 (Other) :37863
## (Other) :56194 NA's : 4414
## gps_height installer longitude latitude wpt_name
## Min. : -90.0 DWE :17402 Min. : 0.00 Min. :-11.649 Shuleni : 1748
## 1st Qu.: 0.0 Government: 1825 1st Qu.:33.09 1st Qu.: -8.541 Zahanati : 830
## Median : 369.0 RWE : 1206 Median :34.91 Median : -5.022 Msikitini: 535
## Mean : 668.3 Commu : 1060 Mean :34.08 Mean : -5.706 Kanisani : 323
## 3rd Qu.:1319.2 DANIDA : 1050 3rd Qu.:37.18 3rd Qu.: -3.326 Bombani : 271
## Max. :2770.0 (Other) :32422 Max. :40.35 Max. : 0.000 (Other) :52128
## NA's : 4435 NA's : 3565
## num_private basin subvillage region
## Min. : 0.0000 Lake Victoria :10248 Madukani: 508 Iringa : 5294
## 1st Qu.: 0.0000 Pangani : 8940 Shuleni : 506 Shinyanga : 4982
## Median : 0.0000 Rufiji : 7976 Majengo : 502 Mbeya : 4639
## Mean : 0.4741 Internal : 7785 Kati : 373 Kilimanjaro: 4379
## 3rd Qu.: 0.0000 Lake Tanganyika: 6432 Mtakuja : 262 Morogoro : 4006
## Max. :1776.0000 Wami / Ruvu : 5987 (Other) :56878 Arusha : 3350
## (Other) :12032 NA's : 371 (Other) :32750
## region_code district_code lga ward population
## Min. : 1.0 Min. : 0.00 Njombe : 2503 Igosi : 307 Min. : 1.0
## 1st Qu.: 5.0 1st Qu.: 2.00 Arusha Rural: 1252 Imalinyi : 252 1st Qu.: 40.0
## Median :12.0 Median : 3.00 Moshi Rural : 1251 Siha Kati: 232 Median : 150.0
## Mean :15.3 Mean : 5.63 Bariadi : 1177 Mdandu : 231 Mean : 281.1
## 3rd Qu.:17.0 3rd Qu.: 5.00 Rungwe : 1106 Nduruma : 217 3rd Qu.: 324.0
## Max. :99.0 Max. :80.00 Kilosa : 1094 Kitunda : 203 Max. :30500.0
## (Other) :51017 (Other) :57958 NA's :21381
## public_meeting recorded_by scheme_management scheme_name
## False: 5055 GeoData Consultants Ltd:59400 VWC :36793 K : 682
## True :51011 WUG : 5206 None : 644
## NA's : 3334 Water authority: 3153 Borehole : 546
## WUA : 2883 Chalinze wate: 405
## Water Board : 2748 M : 400
## (Other) : 4740 (Other) :28557
## NA's : 3877 NA's :28166
## permit construction_year extraction_type extraction_type_group
## False:17492 Min. :1960 gravity :26780 gravity :26780
## True :38852 1st Qu.:1987 nira/tanira: 8154 nira/tanira: 8154
## NA's : 3056 Median :2000 other : 6430 other : 6430
## Mean :1997 submersible: 4764 submersible: 6179
## 3rd Qu.:2008 swn 80 : 3670 swn 80 : 3670
## Max. :2013 mono : 2865 mono : 2865
## NA's :20709 (Other) : 6737 (Other) : 5322
## extraction_type_class management management_group
## gravity :26780 vwc :40507 commercial: 3638
## handpump :16456 wug : 6515 other : 943
## motorpump : 2987 water board : 2933 parastatal: 1768
## other : 6430 wua : 2535 unknown : 561
## rope pump : 451 private operator: 1971 user-group:52490
## submersible : 6179 parastatal : 1768
## wind-powered: 117 (Other) : 3171
## payment payment_type water_quality quality_group
## never pay :25348 annually : 3642 soft :50818 colored : 490
## other : 1054 monthly : 8300 salty : 4856 fluoride: 217
## pay annually : 3642 never pay :25348 unknown : 1876 good :50818
## pay monthly : 8300 on failure: 3914 milky : 804 milky : 804
## pay per bucket : 8985 other : 1054 coloured : 490 salty : 5195
## pay when scheme fails: 3914 per bucket: 8985 salty abandoned: 339 unknown : 1876
## unknown : 8157 unknown : 8157 (Other) : 217
## quantity quantity_group source
## dry : 6246 dry : 6246 spring :17021
## enough :33186 enough :33186 shallow well :16824
## insufficient:15129 insufficient:15129 machine dbh :11075
## seasonal : 4050 seasonal : 4050 river : 9612
## unknown : 789 unknown : 789 rainwater harvesting: 2295
## hand dtw : 874
## (Other) : 1699
## source_type source_class waterpoint_type
## borehole :11949 groundwater:45794 cattle trough : 116
## dam : 656 surface :13328 communal standpipe :28522
## other : 278 unknown : 278 communal standpipe multiple: 6103
## rainwater harvesting: 2295 dam : 7
## river/lake :10377 hand pump :17488
## shallow well :16824 improved spring : 784
## spring :17021 other : 6380
## waterpoint_type_group status_group
## cattle trough : 116 functional :32259
## communal standpipe:34625 functional needs repair: 4317
## dam : 7 non functional :22824
## hand pump :17488
## improved spring : 784
## other : 6380
##
To begin, pump locations are plotted on a map to get a sense of where they are located and determine if they are clustered in certain parts of the country. The pumps, regardless of status, are generally located in the same areas. Although, the Southern tip of Tanzania shows a higher concentration of non-functional pumps.
# Center of Tanzania
map <- get_map(c(lon = 34.888822, lat = -6.369028), source = "google", zoom = 6)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=-6.369028,34.888822&zoom=6&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
ggmap(map) +
geom_point(aes(x = longitude, y = latitude, colour = status_group), data = training.data, alpha = 1/2) +
scale_colour_brewer(palette = "Set1", name = "Status Group")
An inspection of the pump status group shows that just over half of the water pumps are operational at 54.3%.
ggplot(data=training.data, aes(x=status_group)) +
geom_bar(aes(y = (..count..)/sum(..count..)), fill=brewer.pal(3, "Set1"), alpha = 1/2) +
geom_text(aes(y = ((..count..)/sum(..count..)), label = scales::percent((..count..)/sum(..count..))),
stat = "count", vjust = -0.25) +
scale_y_continuous(labels = percent) +
theme(axis.text.y=element_blank(), axis.ticks=element_blank(), axis.title.y=element_blank()) +
theme(plot.title = element_text(hjust = 0.5, face="bold")) +
xlab("Operational Status") + ylab("Percent")
The next few plots compare categorical variables by pump status groups.
Ploting the water pumps by basin and separating them by status shows that the Pangani basin has the most functional water pumps. The Lake Victoria basin has the highest quantity of water pumps that require repair or are not functional. In addition, Lake Rukwa, Lake Tanganyika, and Ruvuma / Southern Coast have more non-functional pumps than functional pumps.
ggplot(training.data, aes(x=basin, fill = basin)) + geom_bar(alpha = 1/2) +
scale_fill_brewer(palette="Set1") + facet_grid(~status_group) +
theme(axis.text.x=element_text(angle = -45, hjust = 0)) +
xlab("") + ylab("Pump Count") + guides(fill=guide_legend(title="Basin"))
Comparing the region plot to the region map illustrates that regions where the quantity of non-functional pumps exceed functional pumps are not clustered in a specific part of the country. These regions include Lindi, Mara, Mtwara, Rukwa, and Tabora.
ggplot(training.data, aes(x=region, fill = status_group)) +
geom_bar(alpha = 1/2) + scale_fill_brewer(palette="Set1") +
theme(axis.text.x=element_text(angle = -45, hjust = 0)) +
xlab("") + ylab("Pump Count") + guides(fill=guide_legend(title="Status Group"))
Figure 1. Tanzania region map. Reprinted from Tanzania regions (para. 1).
A potential hypothesis is that pumps without permits may not be built to the required standards, and therefore may have a higher amount of pumps needing repair or that are non-funtional. However, at the following plot depicts, for each pump status group, permitted pumps outnumber non-permitted pumps.
# Filter out observations with NA
permits <- training.data %>% select(permit, status_group) %>%
filter(! is.na(permit))
ggplot(permits, aes(x=permit, fill = permit)) + geom_bar(alpha = 1/2) +
scale_fill_brewer(palette="Set1") + facet_grid(~status_group) +
theme(axis.text.x=element_text(angle = -45, hjust = 0)) +
xlab("") + ylab("Pump Count") + guides(fill=guide_legend(title="Permit"))
Could the pump extraction type class contribute to the pump status group? As exhibited in the following plot, gravity and handpump extraction have a significantly higher level of functional pumps. In contrast, extraction types of motorpump and other cleary are less reliable.
ggplot(training.data, aes(x=extraction_type_class, fill = extraction_type_class)) +
geom_bar(alpha = 1/2) + scale_fill_brewer(palette="Set1") + facet_grid(~status_group) +
theme(axis.text.x=element_text(angle = -45, hjust = 0)) +
xlab("") + ylab("Pump Count") + guides(fill=guide_legend(title="Extraction Type Class"))
Is there a relationship between the water quality group and pump status? The following plot demonstrates that the overwhelming majority of pumps are classified as having good water. Only in the case of the water quality group of “unknown”, are there more non-fuctional pumps than functional pumps.
ggplot(training.data, aes(x=quality_group, fill = quality_group)) +
geom_bar(alpha = 1/2) + scale_fill_brewer(palette="Set1") + facet_grid(~status_group) +
theme(axis.text.x=element_text(angle = -45, hjust = 0)) +
xlab("") + ylab("Pump Count") + guides(fill=guide_legend(title="Water Quality"))
The quantity of water available to a pump, or the lack thereof, could render a pump non-functional. As the following plot demonstrates, pumps where the water quantity category is dry, have the lowest level of functional pumps.
ggplot(training.data, aes(x=quantity, fill = quantity)) +
geom_bar(alpha = 1/2) + scale_fill_brewer(palette="Set1") + facet_grid(~status_group) +
theme(axis.text.x=element_text(angle = -45, hjust = 0)) +
xlab("") + ylab("Pump Count") + guides(fill=guide_legend(title="Water Quantity"))
A visual inspection of the water source shows that spring and shallow well account for the majority of water sources. In addition, upon visual inspection, both categories have the most functional and non-functional pumps.
# Extend the Set1 palette
#set1.plus1 <- colorRampPalette(brewer.pal(9, "Set1"))(10)
# ggplot(training.data, aes(x=source, fill = source)) +
# geom_bar(alpha = 1/2) + scale_fill_manual(values = set1.plus1) + facet_grid(~status_group) +
# theme(axis.text.x=element_text(angle = -45, hjust = 0)) +
# xlab("") + ylab("Pump Count") + guides(fill=guide_legend(title="Water Source"))
ggplot(training.data, aes(x=source, fill = status_group)) +
geom_bar(alpha = 1/2) + scale_fill_brewer(palette="Set1") +
theme(axis.text.x=element_text(angle = -45, hjust = 0)) +
xlab("") + ylab("Pump Count") + guides(fill=guide_legend(title="Status Group"))
Examining the water point types provide a discernible majority in the communal standpipe and hand pump categories. Across all categories, only communal standpipe multiple and other have more non-functional pumps compared to functional pumps.
ggplot(training.data, aes(x=waterpoint_type, fill = waterpoint_type)) +
geom_bar(alpha = 1/2) + scale_fill_brewer(palette="Set1") + facet_grid(~status_group) +
theme(axis.text.x=element_text(angle = -45, hjust = 0)) +
xlab("") + ylab("Pump Count") + guides(fill=guide_legend(title="Water Point Type"))
The next few plots compare continuous and discrete variables by pump status groups.
The popoluation attribute only contains 39,019 (64.0%) observations with a value greater than zero. In addition the data is signifcantly skewed. Taking the natural log of the population aids in correcting the skew into a somewhat normal bell shape. Future iterations will address options for the observations containing zero.
# Filter out observations with NA
population <- training.data %>% select(population, status_group) %>%
filter(! is.na(population))
dim(population)
## [1] 38019 2
ggplot(population, aes(x = log(population))) +
geom_histogram(bins = 30, fill="#377EB8", alpha = 1/2) +
facet_grid( ~ status_group) + xlab("Population") + ylab("")
The total static head amount variable also contains highly skewed value and a large amount of outliers.
ggplot(training.data, aes(x=status_group, y=amount_tsh)) +
geom_boxplot(fill="#377EB8") + coord_flip() + xlab("") + ylab("")
Each pump status displays a left-skewed histogram. Logically, this is reasonable since more pumps will exist over time as populations increase.
# Filter out observations with NA
constr.year <- training.data %>% select(construction_year, status_group) %>%
filter(! is.na(construction_year))
ggplot(constr.year, aes(x = construction_year)) +
geom_histogram(bins = 20, fill="#377EB8", alpha = 1/2) +
facet_grid( ~ status_group) + xlab("Year") + ylab("")
ggplot(training.data, aes(x = gps_height)) +
geom_histogram(bins = 30, fill="#377EB8", alpha = 1/2) +
facet_grid( ~ status_group) + xlab("Height") + ylab("")
Wikimedia Commons. (2012). Tanzania regions. Retrieved from https://commons.wikimedia.org/wiki/File:Tanzania_regions.svg